Skip to content

Conversation

@kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Aug 26, 2025

Purpose

  • Support fully-expressive attention and kv cache quantization
  • Support running kv cache quantization evals with hf transformers
10cf70de-d58b-4e78-9851-bab24e91d228

Prerequisites

Changes

New Classes

  • Add hookable attention and kvcache implementations which are registered to the attention module as submodules
    • QuantizedAttentionImpl injects itself into the model by registering a new attention implementation called ct_hooked_attention overriding model.config._attn_implementation to be the new implementation name
    • QuantizedKVCache injects itself into the model by overriding the past_key_values input kwarg to attention, and wrapping the functionality of the original cache
    • Calibration and transform hooks can be added to these modules via the hook functions
      • register_query_hook,
      • register_key_hook
      • register_value_hook

Quantization Lifecycle Changes

  • Apply
    • The kv_cache_scheme field of the quantization config is now used to call initialize_hooked_kv_cache
    • Attention modules can now be targeted, and are used to call initialize_hooked_attention if attention modules are explicitly targeted (see is_narrow_match)
    • Remove logic for "merging" kv cache schemes (this doesn't really make any sense, I'm not sure why it was ever included)
  • Initialize
    • Hooked kv cache and attention modules have their quantization parameters initialized by initialize_module_for_quantization
    • The presence of attention or kvcache submodules is what determines whether attention or kv cache only quantization is being applied
  • Serialization
    • QuantizationConfig.from_pretrained was cleaned up with additional comments
    • The kv_cache_scheme field is added if there are any attention modules with a quantization_scheme attached

Helpers

  • is_narrow_match is used to check that attention modules are being specifically targeted (rather than targeting all modules in a layer)
  • get_num_attn_heads, get_num_kv_heads, get_head_dim get attention config values from config

Testing

  • Added tests for is_narrow_match
  • Added tests for added attention and kvcache classes
  • Quantized models
    • kylesayrs/Llama-3.2-1B-Instruct-attention-fp8-head
    • kylesayrs/Llama-3.2-1B-Instruct-attention-nvfp4-head

Evaluation

eval.py
import sys
import lm_eval

model_id = sys.argv[1]

print(model_id)
results = lm_eval.simple_evaluate(
    # 3) hf serialized
    model="hf",
    model_args={
        "pretrained": model_id,
        "add_bos_token": False,
        "dtype": "auto",
        "device_map": "cuda",
        #"max_length": 128000,
    },
    device="cuda",
    # 3/)

    #tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"],
    tasks=["gsm8k_platinum"],
    batch_size=64,
    apply_chat_template=True,
    fewshot_as_multiturn=True,
)
print(model_id)
print(lm_eval.utils.make_table(results))
compress.py
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs

# Select model and load it.
#model_id = "Qwen/Qwen2.5-14B-Instruct-1M"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Select calibration dataset.
DATASET_ID = "ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Configure the quantization algorithm to run.
args = QuantizationArgs(
    num_bits=8,
    type="float",
    strategy="attn_head",
    symmetric=True,
    observer="static_minmax",
)
recipe = QuantizationModifier(
    # config_groups={
    #     "attention": QuantizationScheme(
    #         #targets=["Qwen2Attention"],
    #         targets=["LlamaAttention"],
    #         input_activations=args,
    #     )
    # }
    kv_cache_scheme=args,
)

# Apply algorithms.
oneshot(
    model=model,
    dataset=DATASET_ID,
    splits={"calibration": f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"},
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
sample = tokenizer("Hello my name is", return_tensors="pt")
sample = {key: value.to(model.device) for key, value in sample.items()}
output = model.generate(**sample, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + f"-KV-FP8-{args.strategy}-{args.observer}"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
Model GSM8K
nm-testing/Llama-3.1-8B-Instruct 0.8337
nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Tensor 0.8271
nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Head 0.8354
nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Tensor 0.8321
nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Head 0.8238

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, though i have a number of questions and minor suggestions

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to use this generally for kv_cache and attn quantize, can we move the initialize_hooked_attention and initialize_hooked_kv_cache to initialize.py?

I understand we haven't hooked them in yet for those workflows but I think these belong there.

dsikka
dsikka previously approved these changes Sep 2, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do a pass through on any missing docstring, otherwise lgtm.
nice work

Base automatically changed from kylesayrs/transform-simplify-key to main September 8, 2025 18:46
@dsikka dsikka dismissed stale reviews from brian-dellabetta and themself September 8, 2025 18:46

The base branch was changed.

@kylesayrs kylesayrs force-pushed the kylesayrs/r3-only branch 2 times, most recently from e224a5d to 05ec17e Compare October 8, 2025 19:20
@kylesayrs kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 8, 2025 19:20
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following for the most part. A few clarifications, but this makes sense to me

@kylesayrs kylesayrs marked this pull request as draft October 8, 2025 21:06
@kylesayrs kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch from d084c5e to e3f24d4 Compare October 9, 2025 14:19
@kylesayrs kylesayrs changed the base branch from kylesayrs/add-attn-head-strat to main October 9, 2025 18:14
@kylesayrs kylesayrs dismissed brian-dellabetta’s stale review October 9, 2025 18:14

The base branch was changed.

@kylesayrs kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 9, 2025 18:15
Base automatically changed from kylesayrs/add-attn-head-strat to main October 9, 2025 20:11
@kylesayrs
Copy link
Collaborator Author

@kylesayrs kylesayrs marked this pull request as ready for review October 13, 2025 20:41
@kylesayrs
Copy link
Collaborator Author

Last nightly worked, but e2e failed due to model storage issues
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/18483826999

@kylesayrs kylesayrs force-pushed the kylesayrs/r3-only branch 2 times, most recently from 4cc5ace to 9ead292 Compare October 14, 2025 04:21
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can resolve the global var thread, I have another new comment we might want to consider in a follow-up but marking this as approved. Cool stuff! Excited to see it in action

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions. Otherwise, LGTM

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of completeness, do you mind adding your kv_cache and attn quantized sample models to this PR description?

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs
Copy link
Collaborator Author

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impressive work!

@kylesayrs kylesayrs merged commit e88e7d4 into main Oct 23, 2025
3 checks passed
@kylesayrs kylesayrs deleted the kylesayrs/r3-only branch October 23, 2025 14:35
dsikka pushed a commit to vllm-project/llm-compressor that referenced this pull request Oct 24, 2025
## Purpose ##
* Support fully-expressive attention and kv cache quantization
* Support running kv cache quantization evals with hf transformers
* Resolves #1949
* Resolves #1928

```python3
recipe = QuantizationModifier(
    config_groups={
        "attention": QuantizationScheme(
            targets=["LlamaAttention"],
            input_activations=QuantizationArgs(
                num_bits=8, type="float", strategy="tensor"
            ),
        )
    }
)
```

```json
{
  "quantization_config": {
    "config_groups": {
      "group_0": {
        "format": null,
        "input_activations": {
          "dynamic": false,
          "num_bits": 8,
          "observer": "minmax",
          "strategy": "tensor",
          "symmetric": true,
          "type": "float"
        },
        "output_activations": null,
        "targets": [
          "LlamaAttention"
        ],
        "weights": null
      }
    },
    "format": "dense",
    "ignore": [],
    "kv_cache_scheme": {
      "dynamic": false,
      "group_size": null,
      "num_bits": 8,
      "observer": "minmax",
      "strategy": "tensor",
      "symmetric": true,
      "type": "float"
    },
    "quant_method": "compressed-tensors",
    "quantization_status": "frozen",
  },
}
```

## Prerequisites ##
* Must be merged at the same time as
vllm-project/compressed-tensors#436

## Changes ##
* Replace hooks
* Remove `calibrate_kv_cache_input_hook`,
`calibrate_kv_cache_output_hook`, `initialize_quantized_kv_cache`
* Add `calibrate_query_hook` `calibrate_key_hook`,
`calibrate_value_hook`
* QuantizationMixin now initializes "q", "k", and "v" obsevers
([depending on the attached
submodules](https://github.com/vllm-project/llm-compressor/pull/1651/files#diff-33303ae48e185b2fbb14dc45c2052805837deb5723248367b9579321c4c4e974R263-R270))
and adds the appropriate hooks

* Miscellaneous
  * Fix minor shape bug in `_flatten_attention`
  * Add support for "attn_head" strategy in `_flatten_attention`

* Tests
* Removed old QuantizationKVCache tests (these classes are now tested
[here])(https://github.com/neuralmagic/compressed-tensors/pull/436/files#diff-6e33ff48047dc4f7c9d969293f87e32e4d5ec3f3e8b741ea757780c8c0aab775)
  * Updated scale names to avoid using enum
  * Avoid unnecessary tokenization to reduce runtime

## Testing ##
* Kv cache regression tests pass
* Able to quantize attention with scripts (will add to examples once
loadable in vllm)
  * kylesayrs/Llama-3.2-1B-Instruct-attention-fp8-head
  * kylesayrs/Llama-3.2-1B-Instruct-attention-nvfp4-head
* Nightly passes (in progress)

## Evaluation ##
<details><summary>eval.py</summary>

```python
import sys
import lm_eval

model_id = sys.argv[1]

print(model_id)
results = lm_eval.simple_evaluate(
    # 3) hf serialized
    model="hf",
    model_args={
        "pretrained": model_id,
        "add_bos_token": False,
        "dtype": "auto",
        "device_map": "cuda",
        #"max_length": 128000,
    },
    device="cuda",
    # 3/)

    #tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"],
    tasks=["gsm8k_platinum"],
    batch_size=64,
    apply_chat_template=True,
    fewshot_as_multiturn=True,
)
print(model_id)
print(lm_eval.utils.make_table(results))
```

</details>
    
<details><summary>compress.py</summary>

```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs

# Select model and load it.
#model_id = "Qwen/Qwen2.5-14B-Instruct-1M"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Select calibration dataset.
DATASET_ID = "ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Configure the quantization algorithm to run.
args = QuantizationArgs(
    num_bits=8,
    type="float",
    strategy="attn_head",
    symmetric=True,
    observer="static_minmax",
)
recipe = QuantizationModifier(
    # config_groups={
    #     "attention": QuantizationScheme(
    #         #targets=["Qwen2Attention"],
    #         targets=["LlamaAttention"],
    #         input_activations=args,
    #     )
    # }
    kv_cache_scheme=args,
)

# Apply algorithms.
oneshot(
    model=model,
    dataset=DATASET_ID,
    splits={"calibration": f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"},
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
sample = tokenizer("Hello my name is", return_tensors="pt")
sample = {key: value.to(model.device) for key, value in sample.items()}
output = model.generate(**sample, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + f"-KV-FP8-{args.strategy}-{args.observer}"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```

</details>

Model | GSM8K
-- | --
nm-testing/Llama-3.1-8B-Instruct | 0.8337
nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Tensor | 0.8271
nm-testing/Llama-3.1-8B-Instruct-KV-FP8-Head | 0.8354
nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Tensor | 0.8321
nm-testing/Llama-3.1-8B-Instruct-QKV-FP8-Head | 0.8238

---------

Signed-off-by: Kyle Sayers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants